This assignment is for ETC5521 Assignment 1 by Team numbat comprising of Aarathy Babu, Lachlan Moody, Dilinie Seimon, and Jinhao Luo.
2020 was a bad year for passwords. A recent audit of the ‘dark web’ reported on by Forbes unveiled that over 15 billion stolen logins were currently circulating online Winder, 2020. As stated in the article, for perspective, this represents two sets of account logins for every person on the planet.
This was the result of more than 100,000 data breaches relating to cyber crime activities, a 300% increase since 2018. So in an age where everybody is leaving an ever growing digital record of their activities from social media to banking, what can the average person do to bolster their security online?
The following analysis will explore this current issue in depth using a compilation of some of the most commonly used passwords on the web. It should be noted however that the original data was compiled in September of 2014. There is a possibility therefore that the trends and findings discussed below are not entirely applicable to the modern day. To ensure full relevancy a more up to date collection would be required. However, it is reasonable to assume the underlying foundations of password security have not changed all that much in the past few years. Additionally, the strength rating provided is calculated relative to all the other passwords in the data set. As laid out in the provided documentation, as these common passwords are mostly all ‘bad’, a high strength rating does not necessarily indicate that a password is hard to crack. However, there are additional variables that allow this to be calculated. Detailed information of the data used and the research questions formulated are provided in the following section.
Based upon the motivations discussed above, the following research questions were formulated. The primary subject of interest being:
What are the characteristics of the most common passwords in the interest of security?
Once this exploration area was established, five questions were composed to parameterise the proceeding analysis. They were:
A further analysis of the relationship of the online and offline cracking times of each password will be done in order to understand the underlying factors that might be impacting them.
An analysis of the relationship between the type of characters included in the password such as numbers, special characters, uppercase-lowercase letters, or combination, and the strength of the password will also be done, which might give us a more clear understanding of the reasoning behind the strength of a password.
In order to address these areas and explore the field in greater depth, data was sourced from the book Information is Beautiful (2014). This contained information on 507 passwords derived from online databases Skullsecurity and DigiNinja collected in 2014. The data was provided in a tidy format and was read into R Studio in a csv format directly from the GitHub repository provided by Tidy Tuesday (2020) using the readr (2018) package. Table 2.1 describes each variable included in the dataset.
| Variable | Description |
|---|---|
| rank | popularity of password |
| password | actual text of the password |
| category | password type category |
| value | time to crack password by online guessing |
| time_unit | unit of time for corresponding value |
| offline_crack_sec | time to crack offline in seconds |
| rank_alt | alternative value for rank (same value as rank in all cases) |
| strength | relative strength of password from 1 to 10 |
| font_size | used externally to create graphic for Knowledge is Beautiful (2014) |
A visualization of the data structure can be seen below in Figure 1 using the visdat package (2017).
Figure 2.1: Initial Data Structure
Figure 2.2: Missing Data Values
On further investigation there appeared to be 7 blank rows at the end of the dataset. These observations were subsequently removed using dplyr (2020) as they may have negatively impacted the proceeding analysis and provided no tangible value. The final resulting data frame had 500 observations of 9 variables.
Figure 2.3 allows exploration of the top 500 passwords, their popularity ranks, associated password category and their strengths on a scale of 1-10.
Figure 2.3: Top 500 Most Popular Passwords
Purely based on this table, it can be observed that many of the popular passwords are quite simple and contain ordered numbers or alphabetical series. Surprisingly, the word ‘password’ itself holds the number one spot on the list.
Figure 3.1 visualizes and effectively conveys the 50 most popular passwords, colored by the respective password category.
Figure 3.1: 50 most popular passwords
From the above plot, it can be seen that many of the top ranked passwords belong to a small subset of categories as evidenced by the predominance of the brown, purple and blue color in the wordcloud. This alludes to the possibility that password popularity may be related to password category.
To explore the relationship among the popularity and the password category more closely, the entire dataset is used instead of the 50 most popular passwords in the analysis below.
Figure 3.2 visualizes the proportion of passwords belonging to each category.
Figure 3.2: Proportion of passwords belonging to each category
Figure 3.2 makes it clear that 65% of the passwords belong to either of the three categories of ‘name’, ‘cool-macho’ or ‘simple-alphanumeric’, while figure 3.1 states that the most popular passwords belong to the same categories as well.
In particular, the category ‘name’ dominates all other categories, accounting for over a third of the passwords recorded. This also supports the observation made from the data table that people prefer simple passwords that are easy to remember and hence are ranked as the most popular. This indicates an area of vulnerability, since a name is an easily identifiable piece of information.
Another area of interest in relation to trends among most common passwords, it the password length.
Figure 3.3 is a distribution of passwords based on their length, also broken down by the password category.
Figure 3.3: Distribution of password based on their length
The distribution is seen to be slightly left skewed, with a clear peak on the password length of 6. This indicates that a majority of the passwords contain 6 characters, while only a very few passwords contain 4-5 characters. This maybe due to passwords requiring a minimum of 6 characters.
The longest password in the dataset is 9 characters long, while the shortest is simply 4. The mean length of all the passwords is 6.2.
Table 3.1 is a summary of the minimum, maximum and mean number of characters in each password category. All categories have a similar mean number of characters and similar minimum and maximum number of characters too. Passwords belonging to all categories have a maximum of 8 characters except for the ‘simple-alphanumeric’ category, which records a 9.
| category | Minimum | Maxium | Mean |
|---|---|---|---|
| animal | 4 | 8 | 6.21 |
| cool-macho | 4 | 8 | 6.25 |
| fluffy | 4 | 8 | 5.80 |
| food | 5 | 8 | 6.09 |
| name | 4 | 8 | 6.22 |
| nerdy-pop | 5 | 8 | 6.63 |
| password-related | 4 | 8 | 6.33 |
| rebellious-rude | 5 | 8 | 6.36 |
| simple-alphanumeric | 4 | 9 | 5.93 |
| sport | 4 | 8 | 6.51 |
The plots above provide many insights about the similarities among different password categories. Most passwords are simple and contain approximately 6 characters on average, while the most popular password categories are ‘name’, ‘cool-macho’ and ‘simple-alphanumeric’
The strength of these common passwords is an interesting feature to explore as the each password has been assigned a value in the range of 0-10 relating to its strength. In this scale, 0 relates to a low strength while 10 relates to a high strength.
Since these are commonly used passwords, their strength is expected to be low and easy to crack. The following analysis explores the dataset, to determines how strong the passwords are, and only uses the time taken to crack the password by offline guessing instead of the time taken to crack the password by online guessing.
Figure 3.4: Proportion of passwords of different strength levels
Through figure 3.4, it can be seen that about 3% of the passwords are categorized as very string while 43.6% of them are categorized as strong. 35.4% of the passwords are categorized to have a medium strength while 9.2% and 8.8% of passwords are categorized as weak and very weak respectively. Very Weak category passwords of strength 0-4 constitute around 8.8% of the passwords given.
The 3% of the top 500 common passwords categorized as very strong were assigned a strength value above 10 and since it is populated significantly outside the strength range of 1-10, they will be excluded from the analysis.
<<<<<<< HEADAnother important basis to judge the strength of a password, is to analyze the time taken to crack it. Table 3.2 lists the time to crack the top 10 most popular passwords and is sorted by the popularity rank.
======= Another important basis to judge the strength of a password, is to analyze the time taken to crack it. Figure 3.5 shows the time to crack distribution of all passwords which the strength is less than or equal to 10. The cracking time distribution is significant right-skewed, there are some outliers. In general, the time to crack the passwords is less than 0.25 second, which might implicate that most passwords are easier to be cracked. Some passwords, however, show a longer time to crack, which is more than 2 seconds. But they only account for a little part of the whole passwords. Will the popular passwords also within this small part? A further exploration will be required to identify the relationship between popularity and time to crack.Figure 3.5: Cracking time distribution of all passwords (strength <= 10)
| Popularity rank | password | Time to crack (seconds) |
|---|---|---|
| 1 | password | 2.17000 |
| 2 | 123456 | 0.00001 |
| 3 | 12345678 | 0.00111 |
| 4 | 1234 | 0.00000 |
| 5 | qwerty | 0.00321 |
| 6 | 12345 | 0.00000 |
| 7 | dragon | 0.00321 |
| 8 | baseball | 2.17000 |
| 9 | football | 2.17000 |
| 10 | letmein | 0.08350 |
It’s interesting to see how most of the top 10 popular passwords take less than a second to crack, while 3 or them take approximately 2 seconds to crack. However, an argument can be made that the popularity of the passwords is the reason that the passwords are predictable and therefore easily cracked.
The following analysis is done to identify which category of passwords is the strongest, in terms of the distribution of passwords with different strengths.
The following figure 3.5 is a density plot of the distribution of strengths of passwords belonging to each category. The median strength value of each category is marked to improve readability.
Figure 3.5: Distribution of strengths of passwords belonging to each category =======
Table 3.2 lists the time to crack the top 10 most popular passwords and is sorted by the popularity rank. It’s interesting to see how most of the top 10 popular passwords take less than a second to crack, while 3 or them take approximately 2 seconds to crack. However, an argument can be made that the popularity of the passwords is the reason that the passwords are predictable and therefore easily cracked.
The following analysis is done to identify which category of passwords is the strongest, in terms of the distribution of passwords with different strengths.
The following figure 3.6 is a density plot of the distribution of strengths of passwords belonging to each category. The median strength value of each category is marked to improve readability.
Figure 3.6: Distribution of strengths of passwords belonging to each category >>>>>>> Jinhao
The median strength of passwords belonging to the categories of ‘sport’, ‘nerdy-pop’, ‘name’ and ‘cool-macho’ are the highest at 8, while the median strength of passwords belonging to the ‘simple-alphanumeric’ category is the lowest at 1. It can also be seen that the distribution of strengths of passwords within the ‘simple-alphanumeric’,‘password-related’ and ‘food’ categories are multimodal and are widely distributed across the scale.
For further investigation on which category of passwords is the strongest, the average time to crack a password category is calculated and shown in table 3.3. The categories are sorted in order ascending order of average time to crack. <<<<<<< HEAD Inline with figure 3.5, the ‘simple-alphanumeric’ category records the lowest time to crack while the rebellious-rude category records the longest time to crack contradicting figure 3.5.
======= Inline with figure 3.6, the ‘simple-alphanumeric’ category records the lowest time to crack while the rebellious-rude category records the longest time to crack contradicting figure 3.6. >>>>>>> Jinhao| category | average crack time (seconds) |
|---|---|
| simple-alphanumeric | 0.12 |
| fluffy | 0.16 |
| food | 0.20 |
| animal | 0.24 |
| name | 0.28 |
| nerdy-pop | 0.30 |
| sport | 0.32 |
| password-related | 0.34 |
| cool-macho | 0.35 |
| rebellious-rude | 0.40 |
Through the above analysis it can be identified that the password categories ‘rebellious-rude’ and ‘cool-macho’ are the strongest while ‘simple-alphanumeric’ is the weakest.
Since the time to crack a password offline was used throughout the above analysis, it would be interesting to analyse the relationship between the online and offline times spent on cracking passwords.
<<<<<<< HEADFigure 3.7 represents a plot of the online vs offline times spent on cracking passwords in seconds. The values on both the axis have been put on log scales to minimize superimposing points and for making the plot easier to understand, while the plot has also been sub-divided into facets based on the category of password.
Figure 3.7: Online vs offline crack times of different categories of passwords
It’s interesting how linear the relationship among the online and offline crack times across all password categories. The crack times (both online and offline) of passwords belonging to the simple-alphanumeric category are spread across a wide range of values while the passwords belonging to the rebellious-rude category are only spread across a smaller range of values.
Figure 3.8 represents the top 10 passwords with the highest difference of online and offline crack time. The red dots represent the offline password crack time in seconds while the black dots represent the online password crack time in seconds. It’s interesting how all top 10 passwords with the highest differences record a lower offline crack time in comparison to the online crack time.
Figure 3.8: Top 10 passwords with the highest difference in online and offline crack times =======
Figure 3.8 represents a plot of the online vs offline times spent on cracking passwords in seconds. The values on both the axis have been put on log scales to minimize superimposing points and for making the plot easier to understand, while the plot has also been sub-divided into facets based on the category of password.
Figure 3.8: Online vs offline crack times of different categories of passwords
It’s interesting how linear the relationship among the online and offline crack times across all password categories. The crack times (both online and offline) of passwords belonging to the simple-alphanumeric category are spread across a wide range of values while the passwords belonging to the rebellious-rude category are only spread across a smaller range of values.
Figure 3.9 represents the top 10 passwords with the highest difference of online and offline crack time. The red dots represent the offline password crack time in seconds while the black dots represent the online password crack time in seconds. It’s interesting how all top 10 passwords with the highest differences record a lower offline crack time in comparison to the online crack time.
Figure 3.9: Top 10 passwords with the highest difference in online and offline crack times >>>>>>> Jinhao
Passwords belonging to the categories of name, sport and password-related dominate the list of top most passwords with the highest difference of online and offline crack time. It can be assumed that password categories that make more sense to people are much easier to crack offline. It’s also interesting how the online and offline crack time values across certain passwords are the same, which may lead us to question the accuracy of the dataset used.
Through the exploratory data analysis of the dataset on Top 500 commonly used passwords, it was observed that most people tend to choose passwords that can be easily remembered, therefore a simple password that is related to a name or contains alphanumeric characters and roughly 6-7 characters long is chosen as password. On further exploration it was found that 43.6 % of the commonly used passwords are relatively high in strength and that around 3% of the passwords were of very high strength which varied greatly from typical passwords.
Furthermore ,it was observed that among the password categories, types ‘rebellious-rude’, ‘cool-macho’ are considered strong and take relatively more time to get hacked. Another striking discovery made while analyzing the data is that the hacking time and the strength of the passwords in the dataset is not under any strict relationship and that not all passwords with high strength take long to be cracked and also, not all passwords with low strength are cracked easily as there have been instances of high strength password being hacked quicker than a low strength password.
It can be concluded that most people choose common passwords that can be easily hacked and that using any of the passwords in the dataset is not recommended.
Aden-Buie, Garrick. 2020. Ggpomological: Pomological Plot Themes for Ggplot2. https://github.com/gadenbuie/ggpomological.
Cheng, Joe. 2020. Crosstalk: Inter-Widget Interactivity for Html Widgets. https://CRAN.R-project.org/package=crosstalk.
Cheng, Joe, Carson Sievert, Winston Chang, Yihui Xie, and Jeff Allen. 2020. Htmltools: Tools for Html. https://CRAN.R-project.org/package=htmltools.
“Elegant Visualization of Density Distribution in R Using Ridgeline - Datanovia.” 2020. 2020. https://www.datanovia.com/en/blog/elegant-visualization-of-density-distribution-in-r-using-ridgeline/.
Fellows, Ian. 2018. Wordcloud: Word Clouds. https://CRAN.R-project.org/package=wordcloud.
“Knowledge Is Beautiful, My New Book — Information Is Beautiful.” 2020. McCandleless, D. 2020. http://www.informationisbeautiful.net/2014/knowledge-is-beautiful/.
“New Dark Web Audit Reveals 15 Billion Stolen Logins from 100,000 Breaches.” 2020. Winder, D. 2020. https://www.forbes.com/sites/daveywinder/2020/07/08/new-dark-web-audit-reveals-15-billion-stolen-logins-from-100000-breaches-passwords-hackers-cybercrime/#344d620180fb.
“Password Analyser - Digininja.” 2020. Wood, R. 2020. https://digi.ninja/projects/pipal.php.
“Passwords - Skullsecurity.” 2020. 2020. https://wiki.skullsecurity.org/Passwords.
“Pie Charts.” 2020. 2020. https://plotly.com/r/pie-charts/.
“Rfordatascience/Tidytuesday.” 2020. Mock T. 2020. https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-14.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.
Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.
Wickham, Hadley. 2016a. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2016b. Ggplot2: Elegant Graphics for Data Analysis. springer.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
Wilke, Claus O. 2020. Ggridges: Ridgeline Plots in ’Ggplot2’. https://CRAN.R-project.org/package=ggridges.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2020. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.